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ABSTRACT 

The HemaExplorer (http://servers.binf.ku.dk/hema- 
explorer) is a curated database of processed mRNA 
Gene expression profiles (GEPs) that provides an 
easy display of gene expression in haematopoietic 
cells. HemaExplorer contains GEPs derived from 
mouse/human haematopoietic stem and progenitor 
cells as well as from more differentiated cell types. 
Moreover, data from distinct subtypes of human 
acute myeloid leukemia is included in the database 
allowing researchers to directly compare gene 
expression of leukemic cells with those of their 
closest normal counterpart. Normalization and 
batch correction lead to full integrity of the data in 
the database. The HemaExplorer has comprehensive 
visualization interface that can make it useful as 
a daily tool for biologists and cancer researchers 
to assess the expression patterns of genes encoun- 
tered in research or literature. HemaExplorer is 
relevant for all research within the fields of 
leukemia, immunology, cell differentiation and the 
biology of the haematopoietic system. 



INTRODUCTION 

Haematopoiesis is the developmental process by which 
all blood cells are formed. The haematopoietic system 
is organized in a hierarchical manner with the haema- 
topoietic stem cell (HSC) residing at the apex (Figure 1). 
HSCs have the ability to self-renew for Hfe and to 



differentiate into intermediate progenitor cells that subse- 
quently generate the mature cells of the blood stream and 
immune system (3). Thus, HSCs are ultimately responsible 
for the constant and life-long generation of intermediate 
progenitors and terminally differentiated cells. Genetic 
and epigenetic aberrations may lead to a block in 
normal haematopoietic differentiation, which may result 
in the development of Acute Myeloid Leukemia (AML), 
which is an aggressive blood cancer. 

Given the general interest in haematopoiesis and its 
associated mahgnancies such as AML, a number of data- 
bases now addresses questions related to these fields, a 
majority of which is focused on mRNA expression data. 
There are multiple tools available online that provides 
a quick overview of the expression of single genes in a 
selection of tissues as well as some sorted cells. 
The requirement for purified sorted cell populations in 
the study of haematopoiesis makes the curation of the 
database as well as the availabihty of a reasonable selec- 
tion of haematopoietic cells important. Examples of 
general and non-curated databases with focus on repre- 
senting gene expression based on mRNA microarrays are 
Gene expression Atlas (GXA) (4), GeneNote (5) and 
BioGPS (6). 

GXA provides a count table of over and under expres- 
sion of a query gene in a wide range of cell types and experi- 
mental setups, presently counting 3476 experiments of 
various types. The count numbers are based on differential 
expression within each microarray dataset in a selection 
of pubHcly available microarray data. GeneNote provides 
a bar chart with a continuous ratio scale axis, display- 
ing expression intensities pre-processed and scaled to a 
common mean in a global normalization. The database 
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Figure 1. A model of murine haematopoiesis (1). Dotted line represents self-renewal. Dashed line indicates the separation of MkE Hneage from the 
lymphoid and the remaining myoloid populations as shown by Adolfsson et al. (2). Abbreviations are: LT-HSC, long-term haematopoietic stem cell; 
ST-HSC, short-term haematopoietic stem cell; LMPP, lymphoid-primed multipotential progenitors; CLP, common lymphoid progenitor cells; CMP, 
common myeloid progenitor cell; MEP, megakaryocyte-erythroid progenitor cell; preGMP, pre-granulocyte monocyte progenitors; GMP, granulo- 
cyte monocyte progenitors; MkEP, megakaryocyte erythroid precursors; MkP, megakaryocyte precursor; PreCFU-E, pre-colony-forming unit eryth- 
roid cells; CFU-E, colony-forming unit erythroid cells; NK cells, natural killer cells. 



contains a coherent collection of expression profiles in 
28 healthy human tissues types. BioGPS provides a 
summary of single genes including a chart of gene expres- 
sion based on a single microarray dataset. The variety of 
cell or tissue types depends on the selected dataset. 

Databases dedicated to research in the haematopoietic 
system and haematopoietic disorders includes Leukemia 
Gene Atlas (7), BloodExpress (8) and Haematopoietic 
Fingerprint (9). The Leukemia Gene atlas comprises 
20 datasets from Leukemia research experiments; 13 
of those sets are gene expression profiles (GEPs). The 
database provides the possibiHty of running a number of 
standardized gene expression analysis methods on a selected 
dataset, including value sorted bar plots of expression. 

BloodExplorer is a database of 271 microarray-based 
experiments and provides a list of datasets where a 
query gene is 'present' in terms of a significant P-value 
and presence on the microarray chip. 



Haematopoietic Fingerprints database contains data 
from a mouse dataset with 20 samples, and can provide 
bar charts of gene expression in haematopoietic stem cells 
and mature populations of erythrocytes, granulocytes, 
monocytes, NK cells, activated and naive T cells, and 
B cells. The database also provides an overview of expres- 
sion across the chromosomes for each cell type. 

The time and expenses necessary for sorting cells dras- 
tically reduces the availabihty of data for investigation 
of haematopoiesis, as compared to whole tissue samples. 
This is particularly the case for stem and early progenitor 
cell populations. None of the present databases integrates 
and compare GEPs from haematopoietic cells derived 
from several labs and experiments. Furthermore, there 
are no pubHcly available databases with the possibiHty 
of visualizing gene expression in a range of sorted cells 
from both human and murine haematopoietic system, 
just like a direct comparison with human AML in such 
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a normal context presently will require extensive compu- 
tational or experimental work form the researcher. 

Here, we present the first publically available fully 
integrated and curated database of GEPs of sorted cells 
from stem, progenitor and mature populations important 
for haematopoietic development in both mouse and 
human. Compared to the present databases, this method 
provides a far more complete view of expression in the 
haematopoietic system. The interface allows for easy visu- 
alization of the expression of single genes as well as 
tools to assess correlation of two genes across multiple 
cell types. Furthermore the database includes bone 
marrow from four karyotypic distinct AML subtypes 
and provides two options for investigating mahgnant 
haematopoiesis compared to the normal context. 

MATERIALS AND METHODS 

The database is built from a curated selection of pubHcly 
available datasets containing GEPs from sorted cell 
populations. The database integrates several Affymetrix 
platforms (HG-U133A, HG-U133B and HG-U133 
Plus 2; Mouse Genome 430 2.0). For the human data the 
overlapping probes across the platforms populate the data- 
base. The mouse database is solely based on the Mouse 
Genome 430 2.0 platform. Data were separated into 
batches divided by platform and laboratory. The batches 
were normalized by robust multi-array average (RMA) 
using the affy package for R (10) and corrected for batch 
effects using ComBat (11). Combat is an empirical Bayes 
(EB) method that: First, standardizes the data by a 
gene-wise least-square. Secondly, estimates EB batch 
effect parameters by the method of moments using 
Gaussian and inverse gamma as prior distributions. 
Lastly, the data are adjusted using the calculated batch 
effect estimators. This procedure leaves all data in the 
database directly comparable between samples within the 
human and murine part of the database, respectively. 

From the normalized and batch corrected data a new 
dataset was build containing fold changes between human 
AML and the closest normal counterpart, as identified by 
projecting data to a reduced space defined by principle 
component analysis and deciding Euclidian distance by 
^-nearest neighbour method. (Rapin et aL, manuscript in 
preparation). The model parameters are constant for the 
present set of normal sorted cells, and addition of future 
AMLs will not influence the existing samples. 

In order to handle gene ahases a dictionary of 
gene aliases was constructed from NCBI {ftp://ftp.ncbi 
.nlm.nih.gov/gene/DATA/} and The HUGO Gene 
Nomenclature Committee (HGNC) {www.genenames 
.org}. Ambiguous gene ahases were not included when 
constructing the dictionary. 

The database structure is a combination of Python 
pickle and HDF5 files. The main data handling and 
graphic construction were written in Python NumPy and 
R called from Python via the rpy2 interface (12). Figure 2 
shows a diagram of the database structure and query 
handling of single gene (ERG) lookup (13). 

The database is continuously updated with new data as 
it becomes available. The present selection has been 



focused on including as little batch effect as possible, and 
thus introducing as few platforms and data sources as 
possible. 

RESULTS AND DISCUSSION 

The HemaExplorer has an easy-to-use interface with a 
single query field and the possibiHty to select between 
several data sources. At present, the database has available: 
human normal hematopoietic system, human AML, human 
normal hematopoietic system + human AML, human AML 
compared with closest normal counterpart, and mouse 
haematopoietic system. Cell types and data source are 
Hsted in Table 1 . The database provides plot of expression 
for query genes and accepts unambiguous gene ahases. In 
the case of ambiguous gene ahases a deep link to 
genecards.org (24) with such a query is constructed on the 
interface level, to allow the user to identify the relevant gene 
symbol. 

Currently the interface ahows visuahzation of expres- 
sion of single genes in haematopoietic cells and any two 
genes can also be plotted in a pair-wise correlation plot. 
There is a possibihty to include expression in bulk bone 
marrow of four karyotypes of AML: AML with t(8;21), 
AML with t(15;17), AML with inv(16)/t(16;16) and AML 
with t(llq23)/MLL. Furthermore, the interrelationship 
between these four AML karyotypes can be investigated 
in a plot displaying their fold change with respect to the 
closest normal counterpart. In an advanced mode the user 
can select which of the available haematopoietic cell types 
to be displayed in the output plot, be it single genes, or 
two genes for a correlation plot. 

Database 

The overall integrity of the database is confirmed by 
dimensionality reduction technique (principle component 
analysis), as well as correlation analysis of expression 
profiles by hierarchical clustering, visualized in heatmaps 
(Supplementary Figure SI and S2). Both vahdation tech- 
niques show a separation of ceh types into well-defined 
clusters, and reveal no remaining batch effects. The plots 
are generated using the probe set with the highest mean 
expression, but normalized expression for ah probes 
annotated to the query gene can be downloaded as a flat 
file on the result page. 

Example of correlation plot between two genes 
The possibility of visualizing correlation between two 
genes makes it possible to identify genes that show 
similar expression across ceh types. Furthermore, this 
feature enables researchers to visualize characterizing 
gene pairs with expression patterns that may separate 
ceh populations into distinct clusters and could therefore 
potentiahy improve cell sorting strategies. 

In Figure 3, an example of a FACS analysis for murine 
haematopoietic cehs is depicted together with expression 
correlation plots from HemaExplorer. Gene expression in 
the correlation plots corresponds to the gene that encodes 
the surface marker used to gate ceh populations in the 
FACS sorting strategy, performed according to Pronk 
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Figure 2. Diagram of database construction and output from query. Top-down: The database is build from publicly available data, normalized by RMA 
and batch corrected by ComBat (11). Right-to-left: Queries are sent to gene alias directory, and relevant gene expression data are retrieved from the 
repository. Plot is generated from gene expression data together with figure text and list of abbreviations used in the plot and formatted in to a result page. 



Table 1. Cell types and sources of data currently in the HemaExplorer database 



Samples 


Source 


Acc. number 


ref 


MY, pbPMN, bmPMN, PM 


GEO 


GSE19556 


(14) 


Monocytes 


Array express 


E-MEXP-1242 


(15) 


HSC BM, early HPC BM 


GEO 


GSE2666 


(16) 


CMP, GMP, MEP 


GEO 


GSE19599 


(17) 


HSC BM 


GEO 


GSE17054 


(18) 


Monocytes 


GEO 


GSE11864 


(19) 


Monocytes, B ceUs, CD4+, NK, CD8+, Eosinophils, mDC, Neutrophils, pDC 


GEO 


GSE28490 


(20) 


Leukemia samples 


GEO 


GSE13159 


(21,22) 


CD4+, CFU-E, CLP, ETP, GM, preGMP, IgM + SP, LMPP, LTHSC, MkE, MkP, NK, 


GEO 


GSE14833 


(23) 


PreB, PreCFU-E, ProB, ProE, ST-HSC 








NK, CD8+ naive, CD4+ naive, CD4+ act., CD8+ act., B-CeU, Monocytes, Granulocytes, 


GEO 


GSE6506 


(7) 


Nucl. Erythrocytes 









Grey-shaded cells contain mouse data and non-shaded cells contain human data. 

Abbreviations are: MY, myelocytes; pbPMN, polymorphonuclear cells from peripheral blood; bmPMN, polymorphonuclear cells from bone marrow; 
PM, promyelocytes; HSC BM, haematopoietic stem cells from bone marrow; early HPC BM, early haematopoietic progenitor cells from bone marrow; 
CMP, common myeloid progenitor cells; GMP, granulocyte macrophage progenitor cells; MEP, megakaryocyte-erythroid progenitor cells; CD4+, CD4 
positive T-cells; CD8+, CD8 positive T-cells; mDC, myeloid dendritic cells; pDC, plasmacytoid dendritic cells; CFU-E, colony-forming unit erythroid 
cells; CLP, common lymphoid progenitor cells; ETP, early thymic progenitor cells; preGM, pre granulocyte monocyte progenitor cells; GMP, granulo- 
cyte monocyte progenitor cells; IgM+SP, IgM positive spleen cells; LMPP, lymphoid-primed multipotential progenitor cells; LT-HSC, long-term haem- 
atopoietic stem cells; MkE, megakaryocyte erythroid cells; MkP, megakaryocyte precursor cells; NK, natural killer cells; preB, pre-B cells; PreCFU-E, 
pre-colony-forming unit erythroid cells; ProB, B-cell progenitor cells; ProE, erythroid progenitor cells; ST-HSC, short-term haematopoietic stem cell. 



et al. (1). FACS data and gates in Figure 3 A and B rep- 
resent external biological samples from which there are no 
expression data in the database, and serves illustrative 
purposes only. 

The comparison between a FACS sorting strategy and 
the expression correlation plots from HemaExplorer 
shows similar clustering of cell types. 



Example of single gene lookup 

In Figure 3E, a single gene lookup for BMI-1 is depicted. 
The BMI-1 gene is known for its role in proliferation, 
senescence and self-renewal under various conditions 
(25-27). The expression plot for BMI-1 shows high expres- 
sion for immature cells, gradually decreasing for cell types 
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found in later stages of the myeloid haematopoietic differ- 
entiation pathway. BMI-1 expression in the four AML 
karyotypes included in the database has levels comparable 
to the more immature stages, suggesting that the bulk of 
cells in leukemic bone marrow tends to be less mature and 
expresses more stem-Hke phenotype. 

Example of comparison between human AML and the 
closest normal counterpart 

In Figure 3F, a single gene lookup is depicted for the 
query gene VEGFA (vascular endothehal growth factor 
A). The plot shows the fold change between the AML 
and the closest normal counterpart. The plot shows clear 
difference in mRNA expression of VEGFA between the 
two groups: AML with t(8;21) (AMLI_ETO) and AML 
with t(15;17) (APL) on one side and AML with inv(16)/ 
t(16;16) and AML with t(llq23)/MLL on the other side. 
This, and similar distinct separations of AML karyotypes 
when they are compared to their closest healthy counter- 
part, can provide researchers with valuable research 
targets then investigating differences between gene expres- 
sion in discrete AML subtypes relative to normal cells. 



In conclusion, the HemaExplorer is a curated nor- 
malized and batch corrected database of GEPs in 
normal and malignant haematopoiesis in human and 
mouse. The easy-to-use interface allows for an easy 
lookup of the expression levels of a gene and for the cor- 
relation of expression between pairs of genes. Full integra- 
tion and comparability of data collected from several 
sources extents the scope, and possible overview of 
single mRNA expression in haematopoiesis, compared 
to the present public available databases. Moreover, the 
HemaExplorer contains four karyotypes of human AML 
that can be put in context of the normal haematopoietic 
system, when visualizing gene expression. Therefore, 
HemaExplorer will be of use to researchers within the 
fields of leukemia, immunology, cell differentiation and 
the biology of the haematopoietic system as a powerful 
easy-accessible tool for the assessment of gene expression. 



SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Figures 1 and 2. 
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Figure 3. (A-D) FACS analysis of a subpopulation of murine bone marrow cells depicted together with HemaExplorer gene expression correlation 
plots. Correlation between gene expression is shown for genes encoding the markers used in the FACS plot shown to the right. (A) FACS gating 
strategy for separating GMP from CFU-E, preCFU-E and preGM cells. (B) FACS gating strategy for separating CFU-E, preCFU-E and preGM 
cells. (E) Single gene lookup in the HemaExplorer of the expression of BMI-1 in AML normal sorted cell populations from the myeloid compart- 
ment. (F) Single gene lookup in the HemaExplorer of the VEGFA gene. The dataset is human AML compared with the closest healthy myeloid 
differentiation stage and presents bone marrow samples from AML patients. Abbreviations are: HSC_BM, haematopoietic stem cells from bone 
marrow; early HPC_BM, haematopoietic progenitor cells from bone marrow; CMP, common myeloid progenitor cell; GMP, granulocyte monocyte 
progenitors; MEP, megakaryocyte-erythroid progenitor cell; PM_BM, promyelocyte from bone marrow; MY_BM, myelocyte from bone marrow; 
PMN_BM, polymorphonuclear ceUs from bone marrow; PMN_PB, polymorphonuclear ceUs from peripheral blood; AMLI_ETO, AML with t(8;21); 
APL, AML with t(15;17). 
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